A continuation from PCA analysis of wine dataset: k-means clustering and hierarchical clustering
PCA is used as an exploratory data analysis tool, and may be used for feature engineering and/or clustering. This is a continuation of clustering analysis on the wines dataset in the kohonen package, in which I carry out k-means clustering using the tidymodels framework, as well as hierarchical clustering using factoextra pacage.
Always use only continuous variables, ensure that there are no missing data. Determine the number of components using eigenvalues, scree plots and parallel analysis.
The loading shows the linear combinations of the original variables - ie the new dimension.
The scores show the coordinates of the individual wine samples in the new low-dimensional space.
This dataset is from the kohonen package. It contains 177 rows and 13 columns.
These data are the results of chemical analyses of wines grown in the same region in Italy (Piedmont) but derived from three different cultivars: Nebbiolo, Barberas and Grignolino grapes. The wine from the Nebbiolo grape is called Barolo. The data contain the quantities of several constituents found in each of the three types of wines, as well as some spectroscopic variables.
PCA analysis was performed earlier, and k-means clustering and hierarchical clustering analysis (HCA) will be built upon the PCA loadings.
data(wines)
wines <- as.data.frame(wines) %>%
janitor::clean_names() %>% # require data.frame
as_tibble()
glimpse(wines) # does not contain the types of wine (Y variable)
Rows: 177
Columns: 13
$ alcohol <dbl> 13.20, 13.16, 14.37, 13.24, 14.20, 14.39, …
$ malic_acid <dbl> 1.78, 2.36, 1.95, 2.59, 1.76, 1.87, 2.15, …
$ ash <dbl> 2.14, 2.67, 2.50, 2.87, 2.45, 2.45, 2.61, …
$ ash_alkalinity <dbl> 11.2, 18.6, 16.8, 21.0, 15.2, 14.6, 17.6, …
$ magnesium <dbl> 100, 101, 113, 118, 112, 96, 121, 97, 98, …
$ tot_phenols <dbl> 2.65, 2.80, 3.85, 2.80, 3.27, 2.50, 2.60, …
$ flavonoids <dbl> 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51, …
$ non_flav_phenols <dbl> 0.26, 0.30, 0.24, 0.39, 0.34, 0.30, 0.31, …
$ proanth <dbl> 1.28, 2.81, 2.18, 1.82, 1.97, 1.98, 1.25, …
$ col_int <dbl> 4.38, 5.68, 7.80, 4.32, 6.75, 5.25, 5.05, …
$ col_hue <dbl> 1.05, 1.03, 0.86, 1.04, 1.05, 1.02, 1.06, …
$ od_ratio <dbl> 3.40, 3.17, 3.45, 2.93, 2.85, 3.58, 3.58, …
$ proline <dbl> 1050, 1185, 1480, 735, 1450, 1290, 1295, 1…
Refer to the post for PCA of wine analysis
The dataset did not include the y variable (type of wine), so the update_role() function will be omitted.
step_normalize() combines step_center() and step_scale()
Note that step_pca is the second step –> will need to retrieve the PCA results from the second list later.
wines_recipe <- recipe(~ ., data = wines) %>%
# update_role(vintages, new_role = "id") %>% # skipped
# step_naomit(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors(), id = "pca")
wines_recipe # 13 predictors
Data Recipe
Inputs:
role #variables
predictor 13
Operations:
Centering and scaling for all_predictors()
No PCA components were extracted.
wines_prep <- prep(wines_recipe)
wines_prep # trained
Data Recipe
Inputs:
role #variables
predictor 13
Training data contained 177 data points and no missing data.
Operations:
Centering and scaling for alcohol, malic_acid, ... [trained]
PCA extraction with alcohol, malic_acid, ... [trained]
tidy_pca_loadings <- wines_prep %>%
tidy(id = "pca")
tidy_pca_loadings # values here are the loading
# A tibble: 169 x 4
terms value component id
<chr> <dbl> <chr> <chr>
1 alcohol -0.138 PC1 pca
2 malic_acid 0.246 PC1 pca
3 ash 0.00432 PC1 pca
4 ash_alkalinity 0.237 PC1 pca
5 magnesium -0.135 PC1 pca
6 tot_phenols -0.396 PC1 pca
7 flavonoids -0.424 PC1 pca
8 non_flav_phenols 0.299 PC1 pca
9 proanth -0.313 PC1 pca
10 col_int 0.0933 PC1 pca
# … with 159 more rows
wines_bake <- bake(wines_prep, wines)
wines_bake # has the PCA SCORES to run HCA and k-means clustering
# A tibble: 177 x 5
PC1 PC2 PC3 PC4 PC5
<dbl> <dbl> <dbl> <dbl> <dbl>
1 -2.22 -0.301 -2.03 0.281 -0.259
2 -2.52 1.06 0.974 -0.734 -0.198
3 -3.74 2.80 -0.180 -0.575 -0.257
4 -1.02 0.886 2.02 0.432 0.274
5 -3.04 2.16 -0.637 0.486 -0.630
6 -2.45 1.20 -0.985 0.00466 -1.03
7 -2.06 1.64 0.143 1.20 0.0105
8 -2.51 0.958 -1.78 -0.104 -0.871
9 -2.76 0.822 -0.986 -0.374 -0.437
10 -3.48 1.35 -0.428 -0.0399 -0.316
# … with 167 more rows
Only the scree plot is showed below. Refer to PCA analysis of wine for other options in determining number of PCs.
# b. Scree plot/Variance plot
wines_prep %>%
tidy(id = "pca", type = "variance") %>%
filter(terms == "percent variance") %>%
ggplot(aes(x = component, y = value)) +
geom_point(size = 2) +
geom_line(size = 1) +
scale_x_continuous(breaks = 1:13) +
labs(title = "% Variance explained",
y = "% total variance",
x = "PC",
caption = "Source: Wines dataset from kohonen package") +
theme_classic() +
theme(axis.title = element_text(face = "bold", size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 14, face = "bold")) # 2 or 3
plot_loadings <- tidy_pca_loadings %>%
filter(component %in% c("PC1", "PC2", "PC3")) %>%
mutate(terms = tidytext::reorder_within(terms,
abs(value),
component)) %>%
ggplot(aes(abs(value), terms, fill = value>0)) +
geom_col() +
facet_wrap( ~component, scales = "free_y") +
scale_y_reordered() + # appends ___ and then the facet at the end of each string
scale_fill_manual(values = c("deepskyblue4", "darkorange")) +
labs( x = "absolute value of contribution",
y = NULL,
fill = "Positive?",
title = "PCA Loadings Plot",
subtitle = "Number of PC should be 3, compare the pos and the neg",
caption = "Source: ChemometricswithR") +
theme_minimal() +
theme(title = element_text(size = 24, face = "bold"),
axis.text = element_text(size = 16),
axis.title = element_text(size = 20))
plot_loadings
# PC1: flavonoids, tot_phenols, od_ratio, proanthocyanidins, col_hue, 36%
# PC2: col_int, alcohol, proline, ash, magnesium; 19.2%
# PC3: ash, ash_alkalinity, non_flav phenols; 11.2%
# define arrow style
arrow_style <- arrow(angle = 30,
length = unit(0.2, "inches"),
type = "closed")
# get pca loadings into wider format
pca_loadings_wider <- tidy_pca_loadings%>%
pivot_wider(names_from = component, id_cols = terms)
pca_loadings_only <- pca_loadings_wider %>%
ggplot(aes(x = PC1, y = PC2)) +
geom_segment(aes(xend = PC1, yend = PC2),
x = 0,
y = 0,
arrow = arrow_style) +
ggrepel::geom_text_repel(aes(x = PC1, y = PC2, label = terms),
hjust = 0,
vjust = 1,
size = 5,
color = "red") +
labs(title = "Loadings on PCs 1 and 2 for normalized data") +
theme_classic()
# Scores plot #####
# PCA SCORES are in bake
pc1pc2_scores_plot <- wines_bake %>%
ggplot(aes(PC1, PC2)) +
geom_point(aes(color = vintages, shape = vintages),
alpha = 0.8, size = 2) +
scale_color_manual(values = c("deepskyblue4", "darkorange", "purple")) +
labs(title = "Scores on PCs 1 and 2 for normalized data",
x = "PC1 (36%)",
y = "PC2 (19.2%)") +
theme_classic() +
theme(legend.position = "none")
grid.arrange(pc1pc2_scores_plot, pca_loadings_only, ncol = 2)
The PCA scores will be used for clustering analysis
wines_bake
# A tibble: 177 x 5
PC1 PC2 PC3 PC4 PC5
<dbl> <dbl> <dbl> <dbl> <dbl>
1 -2.22 -0.301 -2.03 0.281 -0.259
2 -2.52 1.06 0.974 -0.734 -0.198
3 -3.74 2.80 -0.180 -0.575 -0.257
4 -1.02 0.886 2.02 0.432 0.274
5 -3.04 2.16 -0.637 0.486 -0.630
6 -2.45 1.20 -0.985 0.00466 -1.03
7 -2.06 1.64 0.143 1.20 0.0105
8 -2.51 0.958 -1.78 -0.104 -0.871
9 -2.76 0.822 -0.986 -0.374 -0.437
10 -3.48 1.35 -0.428 -0.0399 -0.316
# … with 167 more rows
There are 3 common ways for determining the number of clusters:
Let us look at all three of them.
gap_statistic <- cluster::clusGap(wines_bake,
FUN = kmeans,
nstart = 50,
K.max = 10, # max number of clusters
B = 1000) # bootstrap
factoextra::fviz_gap_stat(gap_statistic) # theoretically should have only 3 clusters
fviz_nbclust(wines_bake,
kmeans,
method = "wss") # this suggests 3 clusters, in line with theory
fviz_nbclust(wines_bake,
FUN = hcut,
method = "silhouette") # this suggests 3 clusters
All three methods agree that there should be 3 clusters. This may not always be the case. In any case, we know that there are 3 different types of wine in the dataset.
# exploring different k numbers #####
kclusts_explore <- tibble(k = 1:10) %>%
mutate(kclust = purrr::map(k, ~kmeans(wines_bake, .x)),
tidied = purrr::map(kclust, tidy),
glanced = purrr::map(kclust, glance),
augmented = purrr::map(kclust, augment, wines_bake))
kclusts_explore
# A tibble: 10 x 5
k kclust tidied glanced augmented
<int> <list> <list> <list> <list>
1 1 <kmeans> <tibble [1 × 8]> <tibble [1 × 4]> <tibble [177 × 6…
2 2 <kmeans> <tibble [2 × 8]> <tibble [1 × 4]> <tibble [177 × 6…
3 3 <kmeans> <tibble [3 × 8]> <tibble [1 × 4]> <tibble [177 × 6…
4 4 <kmeans> <tibble [4 × 8]> <tibble [1 × 4]> <tibble [177 × 6…
5 5 <kmeans> <tibble [5 × 8]> <tibble [1 × 4]> <tibble [177 × 6…
6 6 <kmeans> <tibble [6 × 8]> <tibble [1 × 4]> <tibble [177 × 6…
7 7 <kmeans> <tibble [7 × 8]> <tibble [1 × 4]> <tibble [177 × 6…
8 8 <kmeans> <tibble [8 × 8]> <tibble [1 × 4]> <tibble [177 × 6…
9 9 <kmeans> <tibble [9 × 8]> <tibble [1 × 4]> <tibble [177 × 6…
10 10 <kmeans> <tibble [10 × 8]> <tibble [1 × 4]> <tibble [177 × 6…
# turn this into 3 separate datasets, each representing a
# different type of data
#
clusters <- kclusts_explore %>%
unnest(cols = c(tidied))
clusters
# A tibble: 55 x 12
k kclust PC1 PC2 PC3 PC4 PC5
<int> <list> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 <kmea… -1.03e-15 -2.78e-16 -3.22e-16 -5.39e-17 1.76e-16
2 2 <kmea… 1.91e+ 0 -8.65e- 2 -1.73e- 3 7.28e- 2 -1.41e- 2
3 2 <kmea… -1.89e+ 0 8.56e- 2 1.71e- 3 -7.20e- 2 1.39e- 2
4 3 <kmea… 2.71e+ 0 1.10e+ 0 -2.35e- 1 -6.17e- 2 7.64e- 2
5 3 <kmea… 4.43e- 4 -1.76e+ 0 1.85e- 1 -7.36e- 2 7.54e- 2
6 3 <kmea… -2.26e+ 0 9.55e- 1 -5.83e- 4 1.30e- 1 -1.44e- 1
7 4 <kmea… 2.90e- 1 -1.77e+ 0 -8.54e- 1 4.55e- 1 1.35e- 1
8 4 <kmea… 2.76e+ 0 1.21e+ 0 -1.52e- 1 -1.11e- 1 5.10e- 2
9 4 <kmea… -2.39e+ 0 1.04e+ 0 -2.57e- 1 1.29e- 1 -2.19e- 1
10 4 <kmea… -3.41e- 1 -1.30e+ 0 1.28e+ 0 -4.39e- 1 1.17e- 1
# … with 45 more rows, and 5 more variables: size <int>,
# withinss <dbl>, cluster <fct>, glanced <list>, augmented <list>
#
assignments <- kclusts_explore %>%
unnest(cols = c(augmented))
assignments # can be used to plot, with each point colored according to predicted cluster
# A tibble: 1,770 x 10
k kclust tidied glanced PC1 PC2 PC3 PC4 PC5
<int> <list> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 <kmea… <tibb… <tibbl… -2.22 -0.301 -2.03 0.281 -0.259
2 1 <kmea… <tibb… <tibbl… -2.52 1.06 0.974 -0.734 -0.198
3 1 <kmea… <tibb… <tibbl… -3.74 2.80 -0.180 -0.575 -0.257
4 1 <kmea… <tibb… <tibbl… -1.02 0.886 2.02 0.432 0.274
5 1 <kmea… <tibb… <tibbl… -3.04 2.16 -0.637 0.486 -0.630
6 1 <kmea… <tibb… <tibbl… -2.45 1.20 -0.985 0.00466 -1.03
7 1 <kmea… <tibb… <tibbl… -2.06 1.64 0.143 1.20 0.0105
8 1 <kmea… <tibb… <tibbl… -2.51 0.958 -1.78 -0.104 -0.871
9 1 <kmea… <tibb… <tibbl… -2.76 0.822 -0.986 -0.374 -0.437
10 1 <kmea… <tibb… <tibbl… -3.48 1.35 -0.428 -0.0399 -0.316
# … with 1,760 more rows, and 1 more variable: .cluster <fct>
#
clusterings <- kclusts_explore %>%
unnest(cols = c(glanced))
clusterings
# A tibble: 10 x 8
k kclust tidied totss tot.withinss betweenss iter augmented
<int> <list> <list> <dbl> <dbl> <dbl> <int> <list>
1 1 <kmean… <tibbl… 1834. 1834. -2.50e-12 1 <tibble […
2 2 <kmean… <tibbl… 1834. 1192. 6.42e+ 2 1 <tibble […
3 3 <kmean… <tibbl… 1834. 820. 1.01e+ 3 2 <tibble […
4 4 <kmean… <tibbl… 1834. 730. 1.10e+ 3 3 <tibble […
5 5 <kmean… <tibbl… 1834. 659. 1.18e+ 3 3 <tibble […
6 6 <kmean… <tibbl… 1834. 603. 1.23e+ 3 4 <tibble […
7 7 <kmean… <tibbl… 1834. 562. 1.27e+ 3 3 <tibble […
8 8 <kmean… <tibbl… 1834. 545. 1.29e+ 3 5 <tibble […
9 9 <kmean… <tibbl… 1834. 471. 1.36e+ 3 3 <tibble […
10 10 <kmean… <tibbl… 1834. 465. 1.37e+ 3 4 <tibble […
# visualize
# number of clusters
clusterings %>% # from glance
ggplot(aes(k, tot.withinss)) + # total within cluster sum of squares, keep low
geom_line() +
geom_point() +
scale_x_continuous(breaks = 1:10) +
labs(title = "Plot of Total Within Sum of Squares for different number of clusters",
subtitle = "Additional clusters beyond k = 3 have little value") +
theme_classic()
# how datapoints are separated
glimpse(assignments)
Rows: 1,770
Columns: 10
$ k <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ kclust <list> [<1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ tidied <list> [<tbl_df[1 x 8]>, <tbl_df[1 x 8]>, <tbl_df[1 x 8]…
$ glanced <list> [<tbl_df[1 x 4]>, <tbl_df[1 x 4]>, <tbl_df[1 x 4]…
$ PC1 <dbl> -2.223934, -2.524760, -3.744056, -1.017245, -3.040…
$ PC2 <dbl> -0.30145757, 1.05925179, 2.79737289, 0.88586726, 2…
$ PC3 <dbl> -2.0271695, 0.9739613, -0.1798599, 2.0181445, -0.6…
$ PC4 <dbl> 0.281108579, -0.733645703, -0.575492236, 0.4315681…
$ PC5 <dbl> -0.25880549, -0.19804048, -0.25714173, 0.27445613,…
$ .cluster <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
assignments %>% # from augment
ggplot(aes(x = PC1, y = PC2)) + # use PCA data
geom_point(aes(color = .cluster), size = 2, alpha = 0.8) +
facet_wrap( ~ k) +
# to see the center of the clusters
geom_point(data = clusters, size = 9, shape = "x") +
labs(x = "PC1 (36% variance)",
y = "PC2 (19.2% variance",
title = "Visualization of k-means clustering",
subtitle = "Optimal k = 3",
caption = "Source: Wines dataset from kohonen package") +
theme_minimal()
Initially, I was stuck at the visualization part for k-means clustering as I didn’t know how to bring in my x and y-axis data. I had been using the original dataset all along, and was wondering why plots created using the factoextra::fviz_cluster() could report Dim 1 for x axis and Dim 2 for y axis. I finally had the eureka moment when I realised I should use the PCA scores from the bake step earlier.
I really like the tidymodels way of allowing for visualizing how the clusters are separated when different values of k are used. The functions augment, tidy and glance were very efficient in reporting the results for k-means clustering. Previously I only used tidy and glance for regression, and I didn’t know they could be extended to cluster analysis as well.
Lastly, I find dendrograms very aesthetically intuitive and I like how the colors and types of dendrograms could be customised. However, the assumption is that there must be some structure in the data in the first place, otherwise HCA would give very misleading results.
For attribution, please cite this work as
lruolin (2021, Feb. 2). pRactice corner: Clustering Analysis on Wine Dataset. Retrieved from https://lruolin.github.io/myBlog/posts/20210202_Clustering wine/
BibTeX citation
@misc{lruolin2021clustering, author = {lruolin, }, title = {pRactice corner: Clustering Analysis on Wine Dataset}, url = {https://lruolin.github.io/myBlog/posts/20210202_Clustering wine/}, year = {2021} }